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UKQCD 's dynamical fermion project uses the Generalised Hybrid Monte Carlo (GHMC) algorithm to generate 
QCD gauge configurations for a non-perturbatively O(a) improved Wilson action with two degenerate sea-quark 
flavours. We describe our implementation of the algorithm on the Cray-T3E, concentrating on issues arising from 
code verification and performance optimisation, such as parameter tuning, reversibility, the effect of precision, 
the choice of matrix inverter, and the behaviour of different molecular dynamics integration schemes. 



1. Introduction 

First physics results from the UKQCD dynam- 
ical fermions project are presented elsewhere [Q; 
here we discuss issues arising from the implemen- 
tation of GHMC H in Fortran90 on the Cray- 
T3E, such as the optimisation of the code to take 
full advantage of the Cray-T3E architecture, and 
interesting aspects of the algorithm's behaviour 
demonstrated in the course of the code verifica- 
tion process. 

1.1. Notation 

Gauge fields are denoted U and their canoni- 
cally congugate momentum fields P. The HMC 
hamiltonian is 



n = t{p) + s g + s f 



a) 



where S g is the plaquette pure gauge action and 
the fermionic part of the action is 

S f = 0t(M f M)-V- 2 lndet ^ ( 2 ) 



Sf involves pseudofermion fields 4> and the odd- 
even preconditioned fermion matrix 



M-xy — A xx K D XZ A ZZ D zy 



(3) 



where D is the usual Wilson hopping matrix and 
A is the (Sheikoloslami-Wohlert) clover term. 



* Presented by Z. Sroczynski 



2. Code Improvements 

Since our original Fortran implementation, 
hardware specific code optimisation and algorith- 
mic improvements have seen the CPU time re- 
quired to complete one trajectory reduced by a 
factor of 10 to 20, depending on lattice size, f3 
and K sea - 

2.1. Cray-T3E Architectural Factors 

Performance on the Cray-T3E is dominated by 
memory bandwidth. The fact that the pro- 
cessors support multiple instruction issue coupled 
with the increased complexity of the memory sys- 
tem makes it difficult to take full advantage of the 
machine's capabilities with even highly optimised 
Fortran code, so key routines must be written in 
assembler. As a by-product of the tender process, 
we obtained a set of highly optimised fermion ma- 
trix multiplication routines. We have re-written 
a number of additional routines in assembler and 
now less than 25% of the run-time is spent exe- 
cuting Fortran code. 

Using 32-bit instead of 64-bit floating point 
numbers to represent the fields improves speed 
by a factor of 1.7. This must be weighed against 
degradation of the acceptance rate, the reversibil- 
ity and the accuracy of calculations of H (the 
correctness of the algorithm depends on the ac- 
curate computation of the difference of two ex- 
tensive quantities which are constrained to have 
nearly identical magnitudes - eventually, i.e. as 
the lattice volume approaches the reciprocal of 
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the 32-bit machine epsilon, the problem becomes 
intractable). We have taken special care here, 
computing energy differences site by site and per- 
forming summations in higher precision. 

2.2. Algorithmic Improvements 

Using BiCGStab || instead of Conjugate Gra- 
dient to perform the inversions yielded a k depen- 
dent saving of about 40%. 

A simple observation which does not seem to 
be widely known is that the vectors r and s in 
BiCGStab (in the notation of ||) can and should 
be overlapped in memory, saving one vector's 
worth of storage and two memory copies per it- 
eration. 

A further saving of one solve per trajectory can 
be made by observing that at the start of a tra- 
jectory the solution to M'Y — 4> is known ex- 
actly (we initialise the (j> fields by the heatbath 
4> = M*r} for Gaussian noise ij). 

There are special considerations when using the 
Clover action. A similarity transformation re- 
duces the 12 x 12 matrix to two 6x6 matrices ||]. 
This halves the memory requirement and doubles 
the speed of our inverse multiply routine. Instead 
of storing A^ 1 explicitly, we store the L^DL de- 
composition of A ee ; this saves memory, is just as 
efficient, more robust and makes the computation 
of detA (in Sf) trivial. 

Left-preconditioning the BiCGStab by A~^ re- 
duces the number of iterations required for con- 
vergence by some 15%. Using A" 1 as a central 
preconditioner with CG saves about 25% in terms 
of iterations. 

We do the exponentiation of the conjugate mo- 
menta (as required in (||)) exactly (up to ma- 
chine precision). Our first implementation ac- 
complished this by diagonalising the 3 by 3 matri- 
ces using library routines. We improved speed by 
a factor > 6 with a method || which exploits the 
Cayley-Hamilton theorem and requires the solu- 
tion to a Vandermonde system. 



3. Integration Schemes 

The numerical integration of U and P forward 
in "molecular dynamics time" r can be repre- 



sented || as an evolution operator T(dr) 



T(dr) 



U(r) 
P(r) 



U(t + dr) 
P(t + dr) 



(4) 



In general T is composed of two operators; Tjj 
and Tp where 



T P {dr) : U 
Tu(dT) : P 



JdrP 



U (5) 
P-idrduiSg + Sf) (6) 



Being a numerical integrator, T does not con- 
serve H exactly but introduces an error AH. It is 
expected theoretically @ g that AH grows as a 
power of the timestep dr. Verifying that this oc- 
curs with our code provides a strong test that we 
have implemented the equations of motion cor- 
rectly. 

We introduce three integration schemes; 



Tx{dr) 
T 2 (dr) 

T 3 (dr) 



= T P {dT)Tu(dT) 



(7) 
(8) 



Tp^d^TuiadTjTpi^d^TuibdT) 
Tpi^d^TuiadrjTp^dr) (9) 



where a = 1/(2 -2 1 / 3 ), and b = -2 1 / 3 /(2 - 2 1 / 3 ). 
The expectation is that T\, T 2 and T3 cause AH 
to vary as dr 2 , dr 3 and dr 5 . We find that for 
some range of dr where rounding errors do not 
dominate, the slopes of plots of log AW vs. log dr 
are respectively 1.982 ± 0.004, 3.053 ± 0.002 and 
5.056 ±0.006. Similar results were obtained using 
just the pure gauge action and the unimproved 
Wilson action. We can conclude that the equa- 
tions of motion have been correctly implemented. 

The computationally costly part of the the evo- 
lution is the evaluation of djjSf (since it involves 
the inversion of M^M) in (^J), so we split this into 



T pg (dT):P -> P-idrduSg 
T f {dr):P -> P-idrduSf 

and form the integrator 



(10) 
(11) 



T(dr) 



„ ,dr , 



n 



dr x 
2n' 



, dr , 

' , dr . 



(12) 
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Increasing n means that A7i is reduced without 
the additional expense of evaluating djjSf more 
often. We find that increasing n above 2 does not 
have a great effect on the acceptance rate, and 
indeed as the integration demands more compu- 
tational operations per timestep, the accumula- 
tion of rounding errors has a deleterious effect on 
reversibilty and eventually AH. Therefore we use 
L2|) with n < 2 in production. 



4. Estimation of K cr a 

Motivated by the idea that a dynamical gauge 
configuration behaves with some scaling be- 
haviour near a critical point one forms the ansatz 



N C g tx 



1 



1 



(13) 



where Nqq is the number of CG iterations re- 
quired to invert M'M to some given accuracy. 
Then one expects that a plot of log(iVcG) against 
log(l/K — l/K-crit) would be linear. By using var- 
ious guesses for n cr it one can estimate its true 
value as the one that yields a straight line. This 
is shown in figure [l] for the configurations at 
P = 5.2, k = 0.136, c SW = 1-72 on a 12 3 x 24 
lattice. 
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Fi gure 1. Estimation of k c 



The straightest line lies between K cr u = 0.140 
and K cr it = 0.141. so we can perform a detailed 
linear fit in this range (the result from subsequent 
spectroscopy is 0.14033(3) Q). This is a useful 
technique since many solves are performed in a 
HMC run (strictly one should only count the first 
solve of a trajectory, since this is the only one 
that is guaranteed to be done on a physical con- 
figuration) so a statistically significant value for 
Nqg is easily obtained. One then has a quick and 
reasonable accurate preliminary estimate of K C rit ■ 

5. Autocorrelation Analysis 

For the 12 3 x 24 lattice at p = 5.2 we estimate 
the following for the exponential and integrated 
plaquette autocorrelation times r exp and rj n < : 



K 


0.136 


0.137 


0.138 


0.139 


0.1395 


Texp 


17 


28 


40 


36 


50 


Tint 


20 


34 


36 


45 


64 
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